<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
One-Stop Analytics: Exploratory Data Analysis (EDA)
</h1>
Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.
Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.
https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues
<a href="">
</a>
https://www.cdc.gov/ncbddd/autism/data/index.html
<a href="">
</a>
Obtain current R working directory
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Set new R working directory
# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Read in CSV data, storing as R dataframe
# Read back in above saved file:
ASD_National <- read.csv("../dataset/ADV_ASD_National_R.csv")
# Convert Year_Factor to ordered.factor
ASD_National$Year_Factor <- factor(ASD_National$Year_Factor, ordered = TRUE)
<h3>
EDA - Summarization - High Level Data Summary
</h3>
summary(ASD_National)
## Source Year Prevalence Upper.CI Lower.CI
## addm: 8 Min. :2000 Min. : 1.800 Min. : 1.800 Min. : 1.700
## medi:13 1st Qu.:2004 1st Qu.: 3.950 1st Qu.: 3.950 1st Qu.: 3.875
## nsch: 4 Median :2008 Median : 6.650 Median : 6.900 Median : 6.350
## sped:17 Mean :2007 Mean : 7.952 Mean : 8.207 Mean : 7.712
## 3rd Qu.:2011 3rd Qu.: 9.725 3rd Qu.:10.350 3rd Qu.: 9.625
## Max. :2016 Max. :29.200 Max. :30.700 Max. :27.700
##
## Source_Full1
## Autism & Developmental Disabilities Monitoring Network: 8
## Medicaid :13
## National Survey of Children's Health : 4
## Special Education Child Count :17
##
##
##
## Source_Full2
## addm-Autism & Developmental Disabilities Monitoring Network: 8
## medi-Medicaid :13
## nsch-National Survey of Children's Health : 4
## sped-Special Education Child Count :17
##
##
##
## Male.Prevalence Male.Lower.CI Male.Upper.CI Female.Prevalence
## Min. :11.50 Min. :12.20 Min. :13.70 Min. :2.700
## 1st Qu.:13.70 1st Qu.:14.85 1st Qu.:16.07 1st Qu.:3.050
## Median :18.40 Median :20.20 Median :21.55 Median :4.000
## Mean :18.71 Mean :19.22 Mean :20.62 Mean :4.271
## 3rd Qu.:23.55 3rd Qu.:22.93 3rd Qu.:24.32 3rd Qu.:5.250
## Max. :26.60 Max. :25.80 Max. :27.40 Max. :6.600
## NA's :35 NA's :36 NA's :36 NA's :35
## Female.Lower.CI Female.Upper.CI Non.hispanic.white.Prevalence
## Min. :2.600 Min. :3.300 Min. : 7.70
## 1st Qu.:3.100 1st Qu.:3.700 1st Qu.: 9.80
## Median :4.300 Median :4.950 Median :12.00
## Mean :4.217 Mean :4.900 Mean :12.51
## 3rd Qu.:4.975 3rd Qu.:5.675 3rd Qu.:15.55
## Max. :6.200 Max. :7.000 Max. :17.20
## NA's :36 NA's :36 NA's :35
## Non.hispanic.white.Lower.CI Non.hispanic.white.Upper.CI
## Min. : 9.100 Min. :10.40
## 1st Qu.: 9.925 1st Qu.:10.93
## Median :13.100 Median :14.20
## Mean :12.733 Mean :13.88
## 3rd Qu.:15.075 3rd Qu.:16.20
## Max. :16.500 Max. :17.80
## NA's :36 NA's :36
## Non.hispanic.black.Prevalence Non.hispanic.black.Lower.CI
## Min. : 6.50 Min. : 6.200
## 1st Qu.: 7.05 1st Qu.: 7.325
## Median :10.20 Median :10.500
## Mean :10.31 Mean :10.200
## 3rd Qu.:12.70 3rd Qu.:12.100
## Max. :16.00 Max. :15.100
## NA's :35 NA's :36
## Non.hispanic.black.Upper.CI Hispanic.Prevalence Hispanic.Lower.CI
## Min. : 7.600 Min. : 5.900 Min. : 5.000
## 1st Qu.: 8.575 1st Qu.: 6.625 1st Qu.: 5.775
## Median :12.000 Median : 9.000 Median : 8.300
## Mean :11.700 Mean : 9.150 Mean : 8.333
## 3rd Qu.:13.700 3rd Qu.:10.625 3rd Qu.: 9.850
## Max. :16.900 Max. :14.000 Max. :13.100
## NA's :36 NA's :36 NA's :36
## Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## Min. : 6.600 Min. : 9.70
## 1st Qu.: 7.775 1st Qu.:10.97
## Median : 9.750 Median :11.85
## Mean :10.017 Mean :11.72
## 3rd Qu.:11.425 3rd Qu.:12.60
## Max. :14.900 Max. :13.50
## NA's :36 NA's :38
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## Min. : 8.10 Min. :11.60
## 1st Qu.: 9.45 1st Qu.:12.72
## Median :10.30 Median :13.65
## Mean :10.12 Mean :13.57
## 3rd Qu.:10.97 3rd Qu.:14.50
## Max. :11.80 Max. :15.40
## NA's :38 NA's :38
## Source_UC Source_Full3
## ADDM: 8 ADDM Autism & Developmental Disabilities Monitoring Network: 8
## MEDI:13 MEDI Medicaid :13
## NSCH: 4 NSCH National Survey of Children's Health : 4
## SPED:17 SPED Special Education Child Count :17
##
##
##
## Prevalence_Risk2 Prevalence_Risk4 Year_Factor
## High:28 High : 8 2004 : 4
## Low :14 Low :14 2008 : 4
## Medium :18 2012 : 4
## Very High: 2 2000 : 3
## 2002 : 3
## 2006 : 3
## (Other):21
if(!require(ggplot2)){install.packages("ggplot2")}
## Loading required package: ggplot2
library(ggplot2)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] Explore the Data</span>
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Explore the Data</span>
</h3>
# ----------------------------------
# [National] < Years Data Available >
# ----------------------------------
p = ggplot(ASD_National, aes(x = 1, fill = Source)) +
geom_bar() + theme(axis.text.x=element_blank(), # Hide axis
axis.ticks.x=element_blank(), # Hide axis
axis.text.y=element_blank(), # Hide axis
axis.ticks.y=element_blank(), # Hide axis
panel.background = element_blank(), # Remove panel background
legend.position="top"
) +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
labs(x="", y="", title="Years Data Available") + # layers of graphics
facet_grid(facets = Source~Year)
# Show plot
p
<h3>
Data Visualisation (Enhanced) - Barplot
</h3>
# Create bar chart using R graphics
barplot(table(ASD_National$Source))
# Create bar chart using ggplot2
ggplot(ASD_National, aes(x = Source)) + geom_bar(fill = "blue", alpha=0.5)
# Use color to differentiate sub-group data (Year)
ggplot(ASD_National, aes(x = Source, fill = factor(Year))) + geom_bar() +
theme(legend.position="top") + labs(fill = "Legend: Year")
# Split chart to mutiple columns by using: facets = . ~ Year
ggplot(ASD_National, aes(x = Source, fill = Source)) + geom_bar() +
theme(legend.position="top") +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
facet_grid(facets = . ~ Year)
# Split chart to mutiple rows and columns by using: facets = Source ~ Year
ggplot(ASD_National, aes(x = Source, fill = Source)) + geom_bar() +
theme(legend.position="top") +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
facet_grid(facets = Source~Year)
Above chart is now very similar to earlier [National] < Years Data Available >.
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Prevalence by Data Sources and Risk Levels</span>
</h3>
# Use color to differentiate sub-group data (Year)
ggplot(ASD_National, aes(x = Source, fill = Prevalence_Risk4)) +
geom_bar(alpha=0.95, position = position_stack(reverse = TRUE)) + # Reverse default colour/fill order
scale_fill_manual("Data Source:", values = c("Low" = "lightyellow",
"Medium" = "orange",
"High" = "red",
"Very High" = "darkred")) +
labs(x="Data Sources", y="Occurrences", title="Prevalence by Data Sources and Risk Levels") + # layers of graphics
theme(legend.position="top") + labs(fill = "Legend: Risk")
Barplot / Column plot
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] REPORTED PREVALENCE VARIES BY SEX</span>
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] [ Year: 2014 ]
</h3>
# Filter only data of ADDM
ASD_National_ADDM <- subset(ASD_National, Source == 'addm')
#
ASD_National_ADDM
## Source Year Prevalence Upper.CI Lower.CI
## 1 addm 2000 6.7 7.0 6.3
## 2 addm 2002 6.6 6.8 6.3
## 3 addm 2004 8.0 8.4 7.6
## 4 addm 2006 9.0 9.3 8.6
## 5 addm 2008 11.3 11.7 11.0
## 6 addm 2010 14.7 15.1 14.3
## 7 addm 2012 14.8 15.2 14.4
## 8 addm 2014 16.8 17.3 16.4
## Source_Full1
## 1 Autism & Developmental Disabilities Monitoring Network
## 2 Autism & Developmental Disabilities Monitoring Network
## 3 Autism & Developmental Disabilities Monitoring Network
## 4 Autism & Developmental Disabilities Monitoring Network
## 5 Autism & Developmental Disabilities Monitoring Network
## 6 Autism & Developmental Disabilities Monitoring Network
## 7 Autism & Developmental Disabilities Monitoring Network
## 8 Autism & Developmental Disabilities Monitoring Network
## Source_Full2 Male.Prevalence
## 1 addm-Autism & Developmental Disabilities Monitoring Network NA
## 2 addm-Autism & Developmental Disabilities Monitoring Network 11.5
## 3 addm-Autism & Developmental Disabilities Monitoring Network 12.9
## 4 addm-Autism & Developmental Disabilities Monitoring Network 14.5
## 5 addm-Autism & Developmental Disabilities Monitoring Network 18.4
## 6 addm-Autism & Developmental Disabilities Monitoring Network 23.7
## 7 addm-Autism & Developmental Disabilities Monitoring Network 23.4
## 8 addm-Autism & Developmental Disabilities Monitoring Network 26.6
## Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1 NA NA NA NA NA
## 2 NA NA 2.7 NA NA
## 3 12.2 13.7 2.9 2.6 3.3
## 4 13.9 15.1 3.2 2.9 3.5
## 5 17.7 19.0 4.0 3.7 4.3
## 6 23.0 24.4 5.3 5.0 5.7
## 7 22.7 24.1 5.2 4.9 5.6
## 8 25.8 27.4 6.6 6.2 7.0
## Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1 NA NA
## 2 7.7 NA
## 3 9.7 9.1
## 4 9.9 9.4
## 5 12.0 11.5
## 6 15.8 15.2
## 7 15.3 14.7
## 8 17.2 16.5
## Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1 NA NA
## 2 NA 6.5
## 3 10.4 6.9
## 4 10.4 7.2
## 5 12.5 10.2
## 6 16.3 12.3
## 7 15.9 13.1
## 8 17.8 16.0
## Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 6.2 7.6 6.2
## 4 6.6 7.8 5.9
## 5 9.5 10.9 7.9
## 6 11.5 13.1 10.8
## 7 12.3 13.9 10.1
## 8 15.1 16.9 14.0
## Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1 NA NA NA
## 2 NA NA NA
## 3 5.0 7.5 NA
## 4 5.3 6.6 NA
## 5 7.2 8.6 9.7
## 6 10.0 11.6 12.3
## 7 9.4 10.9 11.4
## 8 13.1 14.9 13.5
## Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 8.1 11.6
## 6 10.7 14.2
## 7 9.9 13.1
## 8 11.8 15.4
## Source_UC Source_Full3
## 1 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 2 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 3 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 4 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 5 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 6 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 7 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 8 ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## Prevalence_Risk2 Prevalence_Risk4 Year_Factor
## 1 High Medium 2000
## 2 High Medium 2002
## 3 High Medium 2004
## 4 High Medium 2006
## 5 High High 2008
## 6 High High 2010
## 7 High High 2012
## 8 High High 2014
# Construct a new re-shaped dataframe of [ Source: ADDM ] [Year: 2014]
#
Process_Source = 'addm'
Process_Year = 2014
Define a function to create a re-shaped dataframe:
Function_Reshape_ASD_National_ADDM <- function(Process_Source, Process_Year) {
# Create the vectors:
Sex.Group = c('Overall',
'Boys',
'Girls')
Sex.Group
Prevalence = c(ASD_National_ADDM$Prevalence[ASD_National_ADDM$Year == Process_Year],
ASD_National_ADDM$Male.Prevalence[ASD_National_ADDM$Year == Process_Year],
ASD_National_ADDM$Female.Prevalence[ASD_National_ADDM$Year == Process_Year])
Prevalence
# Combine all the vectors into a data frame:
ASD_National_ADDM_Reshaped_DF = data.frame(Sex.Group, Prevalence, stringsAsFactors=T)
# Add new columns:
ASD_National_ADDM_Reshaped_DF$Source = Process_Source
ASD_National_ADDM_Reshaped_DF$Year = Process_Year
return(ASD_National_ADDM_Reshaped_DF) # Return a dataframe
}
Use defined function Function_Reshape_ASD_National_ADDM( ) for a specific year:
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2014)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 16.8 addm 2014
## 2 Boys 26.6 addm 2014
## 3 Girls 6.6 addm 2014
Visualise: Prevalence Estimates by Sex [ Source: ADDM ] [ Year: 2014 ]
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
ggplot(ASD_National_ADDM_Reshaped_DF, aes(Sex.Group, Prevalence)) +
geom_col(aes(fill = Sex.Group), alpha=0.5) + # Use column chart
geom_text(aes(label = Prevalence), vjust = +0.75, hjust = -0.2, size = 3) +
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_discrete(name = "") +
scale_fill_manual("Sex Group:", values = c("Overall" = "purple",
"Boys" = "blue",
"Girls" = "orange")) +
ggtitle("Prevalence Estimates by Sex [ Source: ADDM ] [ Year: 2014 ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
legend.position = 'none') +
coord_flip() # Rotate chart
# facet_grid(facets = Year ~ .)
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] [ Year: ALL ]
</h3>
# Create a new datafarme to hold re-shaped data for all years.
ASD_National_ADDM_Reshaped_DF_All = ASD_National_ADDM_Reshaped_DF # Loaded with initial [ Year: 2014 ] data
Process_Source = 'addm'
unique(ASD_National_ADDM$Year)
## [1] 2000 2002 2004 2006 2008 2010 2012 2014
Use defined function Function_Reshape_ASD_National_ADDM( ) for ALL remaining years:
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2012)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 14.8 addm 2012
## 2 Boys 23.4 addm 2012
## 3 Girls 5.2 addm 2012
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2010)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 14.7 addm 2010
## 2 Boys 23.7 addm 2010
## 3 Girls 5.3 addm 2010
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2008)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 11.3 addm 2008
## 2 Boys 18.4 addm 2008
## 3 Girls 4.0 addm 2008
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2006)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 9.0 addm 2006
## 2 Boys 14.5 addm 2006
## 3 Girls 3.2 addm 2006
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2004)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 8.0 addm 2004
## 2 Boys 12.9 addm 2004
## 3 Girls 2.9 addm 2004
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2002)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 6.6 addm 2002
## 2 Boys 11.5 addm 2002
## 3 Girls 2.7 addm 2002
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2000)
ASD_National_ADDM_Reshaped_DF
## Sex.Group Prevalence Source Year
## 1 Overall 6.7 addm 2000
## 2 Boys NA addm 2000
## 3 Girls NA addm 2000
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
# Re-shaped ADDM data for ALL years:
ASD_National_ADDM_Reshaped_DF_All
## Sex.Group Prevalence Source Year
## 1 Overall 16.8 addm 2014
## 2 Boys 26.6 addm 2014
## 3 Girls 6.6 addm 2014
## 4 Overall 14.8 addm 2012
## 5 Boys 23.4 addm 2012
## 6 Girls 5.2 addm 2012
## 7 Overall 14.7 addm 2010
## 8 Boys 23.7 addm 2010
## 9 Girls 5.3 addm 2010
## 10 Overall 11.3 addm 2008
## 11 Boys 18.4 addm 2008
## 12 Girls 4.0 addm 2008
## 13 Overall 9.0 addm 2006
## 14 Boys 14.5 addm 2006
## 15 Girls 3.2 addm 2006
## 16 Overall 8.0 addm 2004
## 17 Boys 12.9 addm 2004
## 18 Girls 2.9 addm 2004
## 19 Overall 6.6 addm 2002
## 20 Boys 11.5 addm 2002
## 21 Girls 2.7 addm 2002
## 22 Overall 6.7 addm 2000
## 23 Boys NA addm 2000
## 24 Girls NA addm 2000
Visualise: Prevalence Estimates by Sex [ Source: ADDM ] [ Year: ALL ]
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
ggplot(ASD_National_ADDM_Reshaped_DF_All, aes(Sex.Group, Prevalence)) +
geom_col(aes(fill = Sex.Group), alpha=0.75) + # Use column chart
geom_text(aes(label = Prevalence), vjust = +0.5, hjust = -0.2, size = 2.5) +
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_discrete(name = "") +
scale_fill_manual("Sex Group:", values = c("Overall" = "purple",
"Boys" = "blue",
"Girls" = "orange")) +
ggtitle("Prevalence Estimates by Sex [ Source: ADDM ] [ Year: ALL ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
legend.position = 'none') +
coord_flip() + # Rotate chart
facet_grid(facets = Year ~ .)
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (geom_text).
<h3>
Data Visualisation (Enhanced) - Histogram (distribution of binned continuous variable)
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# Create histogram using R graphics
hist(ASD_National$Prevalence)
# Create histogram using ggplot2
ggplot(ASD_National, aes(x=Prevalence)) +
geom_histogram(binwidth = 5, fill = "blue", color = "lightgrey", alpha=0.5)
# Use color to differentiate sub-group data (Data Source)
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
geom_histogram(binwidth = 5) +
theme_bw() + theme(legend.position="right") +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue"))
# Plot sub-group data side by side, using position="dodge"
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
geom_histogram(binwidth = 5, position="dodge") +
theme_bw() + theme(legend.position="right") +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue"))
# Split plots using facet_grid()
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
geom_histogram(binwidth = 5) +
theme(legend.position="right") +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
facet_grid(facets = Source ~ .)
# Add title and caption using ggplot2
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
geom_histogram(binwidth = 5) +
theme(legend.position="top") +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
labs(x="Prevalence per 1,000 Children",
y="Frequency",
title="Distribution of Prevalence by Data Source") +
facet_grid(facets = Source ~ .)
<h3>
Data Visualisation (Enhanced) - Density plot (distribution for continuous variable normalized to 100% area under curve)
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# Create plot using R graphics
plot(density(ASD_National$Prevalence))
# Optionally, overlay histogram
hist(ASD_National$Prevalence, probability = TRUE, add = TRUE)
# Create plot using ggplot2
p <- ggplot(ASD_National) +
geom_density(aes(x=Prevalence), fill = "grey", color = "white", alpha=0.75)
p # Show
# Optionally, overlay histogram
p <- p + geom_histogram(aes(x = Prevalence, y = ..density..), binwidth = 1, fill = "blue", colour = "lightgrey", alpha=0.4)
p # Show
# Optionally, overlay Prevalence mean
p <- p + geom_vline(aes(xintercept = mean(ASD_National$Prevalence)), colour="darkorange")
p # Show
# Lastly, add other captions
p <- p + coord_cartesian(xlim=c(0, 35), ylim=c(0, 0.2)) +
labs(x="Prevalence per 1,000 Children", y="Density",
title=paste("Density of Prevalence ( mean =", mean(ASD_National$Prevalence), ")")) +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
p # Show
< Prevelance distribution by Data Source >
# Prevelance distribution by Data Source
ggplot(ASD_National) + geom_density(aes(x = Prevalence, fill = Source), alpha = 0.5) +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
labs(x="Prevalence per 1,000 Children",
y="Density",
title="Density of Prevalence by Data Source") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
< Prevelance distribution by Data Source with split >
# Prevelance distribution by Data Source with split
ggplot(ASD_National) + geom_density(aes(x = Prevalence, fill = Source), colour = 'lightgrey', alpha = 0.75) +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
labs(x="Prevalence per 1,000 Children",
y="Density",
title="Density of Prevalence by Data Source") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey")) +
facet_wrap(~Source)
<h3>
Data Visualisation (Enhanced) - Box plot
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# Create plot using R graphics
# Create 'Prevalence' box plots break by 'Source'
boxplot(ASD_National$Prevalence ~ ASD_National$Source,
main = "National ASD Prevalence by Data Source",
xlab = "Data Source",
ylab = "Prevalence per 1,000 Children",
sub = "Year 2000 - 2016",
col.main="blue", col.lab="black", col.sub="darkgrey")
# Create box plot using ggplot2
ggplot(ASD_National, aes(x = Source, y = Prevalence, fill = Source)) +
geom_boxplot(alpha = 0.5) +
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_discrete(name = "Data Source (Year 2000 - 2016)") +
ggtitle("National ASD Prevalence by Data Source") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
<h3>
Data Visualisation (Enhanced) - Violin plot
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# Create plot using ggplot2
ggplot(ASD_National, aes(x = Source, y = Prevalence, fill = Source)) +
geom_violin(alpha = 0.5) +
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_discrete(name = "Data Source (Year 2000 - 2016)") +
ggtitle("National ASD Prevalence by Data Source") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Create plot using ggplot2
ggplot(ASD_National, aes(x = Source, y = Prevalence, fill = Source)) +
geom_violin(alpha = 0.5) +
geom_jitter(alpha = 0.5, position = position_jitter(width = 0.1)) + # Overlay datapoints
# coord_flip() + # Uncomment to flip x-y axis
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_discrete(name = "Data Source (Year 2000 - 2016)") +
ggtitle("National ASD Prevalence by Data Source") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
<h3>
Data Visualisation (Enhanced) - Line chart
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] REPORTED PREVALENCE HAS CHANGED OVER TIME</span>
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE HAS CHANGED OVER TIME</span> [Source: ALL]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# Build chart/plot layer by layer
# ----------------------------------
# Define a ggplot graphic object; provide data and x y for use
p <- ggplot(ASD_National, aes(x = Year, y = Prevalence))
# Show plot
p
# Select (add) line chart type:
p <- p + geom_line(aes(color = Source),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5)
# Show plot
p
# Select (add) points to chart:
p <- p + geom_point(aes(color = Source),
size=2,
shape=20,
alpha=0.5)
# Show plot
p
# Customize line color and legend name:
p <- p + scale_color_manual("Data Source:",
labels = c('ADDM', 'MEDI', 'NSCH', 'SPED'),
values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue"))
# Show plot
p
# Adjust x and y axis, scale, limit and labels:
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_continuous(name = "Year",
breaks = seq(2000, 2016, 1),
limits = c(2000, 2016))
# Show plot
p
# Customise chart title:
p <- p + ggtitle("Prevalence Estimates Over Time [ Source: ALL ]")
# Show plot
p
# Customise chart title and axis labels:
p <- p + theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Show plot
p
Consolidate above code into one chunk:
# ----------------------------------
# Consolidate above code into one chunk
# ----------------------------------
p <- ggplot(ASD_National, aes(x = Year, y = Prevalence)) +
geom_line(aes(color = Source),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(color = Source),
size=2,
shape=20,
alpha=0.5) +
scale_color_manual("Data Source:",
labels = c('ADDM', 'MEDI', 'NSCH', 'SPED'),
values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_continuous(name = "Year",
breaks = seq(2000, 2016, 1),
limits = c(2000, 2016)) +
ggtitle("Prevalence Estimates Over Time [ Source: ALL ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Show plot
p
Optionally, display data values/labels:
# Optionally, displaydata values/labels
p + geom_text(aes(label = round(Prevalence, 1)), # Values are rounded for display
vjust = "outward",
# nudge_y = 0.2, # optionally life the text
hjust = "outward",
check_overlap = TRUE,
size = 3, # size of textual data label
col = 'darkslategrey')
<h3>
Data Visualisation (Enhanced) - Dynamic Visualisation with plotly
</h3>
if(!require(knitr)){install.packages("knitr")}
## Loading required package: knitr
library("knitr")
if(!require(plotly)){install.packages("plotly")}
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library("plotly")
Create ployly graph object from ggplot graph object:
p_dynamic <- p
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Data Visualisation (Enhanced) - Use themes as aesthetic template
</h3>
if(!require(ggthemes)){install.packages("ggthemes")}
## Loading required package: ggthemes
library('ggthemes')
Theme of the Economist magazine:
# Theme of the economist magazine:
p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
Theme of the Wall Street Journal:
# Theme of the Wall Street Journal:
p + theme_wsj() + scale_colour_wsj("colors6")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
Dynamic chart with theme of the economist magazine:
# Dynamic chart with theme of the economist magazine:
p_dynamic <- p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] ADDM Network estimates for overall ASD prevalence in US over time</span> [ Source: ADDM ] over [ Year ]
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] ADDM Network estimates for overall ASD prevalence in US over time</span> [ Source: ADDM ] over [ Year ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# Filter only data of ADDM
ASD_National_ADDM <- subset(ASD_National, Source == 'addm')
# ----------------------------------
# [addm] ADDM Network estimates for overall ASD prevalence in US over time
# ----------------------------------
# Color:
# 'ADDM_Average' "purple"
p <- ggplot(ASD_National_ADDM, aes(x = Year, y = Prevalence)) +
geom_point(aes(y = Prevalence, color = 'ADDM_Average'), # Name for manual colour mapping
size=2,
shape=20,
alpha=0.95) +
# Add point for Upper.CI
geom_point(aes(y = Upper.CI, color = 'ADDM_U_CI'), # Name for manual colour mapping
size=0.1,
shape=20,
alpha=0.95) +
# Add point for Upper.CI
geom_point(aes(y = Lower.CI, color = 'ADDM_L_CI'), # Name for manual colour mapping
size=0.1,
shape=20,
alpha=0.95) +
scale_colour_manual(name="",
labels = c("US (ADDM)", "Upper CI", "Lower CI"), # Names shown in legend
values = c(ADDM_Average="purple", ADDM_U_CI="red", ADDM_L_CI="red")) # Manual colour mapping
# Add title, axis label, and axis scale
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 18, 2),
limits=c(0, 18)) +
scale_x_continuous(name = "Year",
breaks = seq(2000, 2014, 2),
limits = c(2000, 2014)) +
ggtitle("ADDM Network estimates for overall ASD prevalence in US over time\nwith confidence interval") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
panel.background = element_blank(), # Remove chart backgroun colour
legend.position = 'top',
panel.grid.major = element_line(size = 0.2, linetype = 'solid', colour = "lightgrey") # grid colour et al
)
# Show plot
p
# Add smooth curve to go through date points, using interpolation with splines:
# https://stackoverflow.com/questions/35205795/plotting-smooth-line-through-all-data-points
spline_ADDM_Prevalence <- as.data.frame(spline(ASD_National_ADDM$Year, ASD_National_ADDM$Prevalence))
spline_ADDM_Prevalence_U_CI <- as.data.frame(spline(ASD_National_ADDM$Year, ASD_National_ADDM$Upper.CI))
spline_ADDM_Prevalence_L_CI <- as.data.frame(spline(ASD_National_ADDM$Year, ASD_National_ADDM$Lower.CI))
# Show plot
p + geom_line(data = spline_ADDM_Prevalence, aes(x = x, y = y, color = 'ADDM_Average'), linetype = "solid", size=0.6) +
geom_line(data = spline_ADDM_Prevalence_U_CI, aes(x = x, y = y, color = 'ADDM_U_CI'), linetype = 2, size=0.3) +
geom_line(data = spline_ADDM_Prevalence_L_CI, aes(x = x, y = y, color = 'ADDM_L_CI'), linetype = 2, size=0.3)
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] over [ Year ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# [addm] < Prevalence Varies by Sex >
# ----------------------------------
# Color:
# 'ADDM_Average' "darkslategrey"
# 'Female_Prevalence' "orange"
# 'Male_Prevalence' "blue"
p <- ggplot(ASD_National_ADDM, aes(x = Year, y = Prevalence)) +
geom_line(aes(y = Prevalence, colour = 'ADDM_Average'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Prevalence, color = 'ADDM_Average'),
size=2,
shape=20,
alpha=0.5) +
# Add line for Female
geom_line(aes(y = Female.Prevalence, colour = 'Female_Prevalence'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Female.Prevalence, color = 'Female_Prevalence'),
size=2,
shape=20,
alpha=0.5) +
# Add line for Male
geom_line(aes(y = Male.Prevalence, colour = 'Male_Prevalence'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Male.Prevalence, color = 'Male_Prevalence'),
size=2,
shape=20,
alpha=0.5) +
scale_colour_manual(name="",
labels = c("ADDM Average", "Female Prevalence", "Male Prevalence"),
values = c(ADDM_Average="darkslategrey", Female_Prevalence="orange", Male_Prevalence="blue"))
# Add title, axis label, and axis scale
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_continuous(name = "Year",
breaks = seq(2000, 2016, 1),
limits = c(2000, 2016)) +
ggtitle("Prevalence Estimates by Sex [ Source: ADDM ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Show plot
p
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
# Apply theme
p + theme_economist() + scale_colour_economist() # p + theme_wsj() + scale_colour_wsj("colors6")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
# Dynamic chart:
p_dynamic <- p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Quiz:
</h3>
<p>
Add 95% Confidence Interval to above plot (Use ggplot)
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] REPORTED PREVALENCE VARIES BY RACE AND ETHNICITY</span>
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY RACE AND ETHNICITY</span> [ Source: ADDM ] With Average
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# [addm] < Prevalence Varies by Race and Ethnicity >
# ----------------------------------
# Color:
# 'ADDM_Average' "darkslategrey"
# 'Asian_Pacific_Islander' "darkred"
# 'Hispanic' "darkorchid3"
# 'Non_Hispanic_Black' "deepskyblue3"
# 'Non_Hispanic_White' "chartreuse3"
p <- ggplot(ASD_National_ADDM, aes(x = Year, y = Prevalence)) +
geom_line(aes(y = Prevalence, colour = 'ADDM_Average'),
linetype = "dotted", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Prevalence, color = 'ADDM_Average'),
size=2,
shape=20,
alpha=0) +
# Add line for Asian.or.Pacific.Islander.Prevalence
geom_line(aes(y = Asian.or.Pacific.Islander.Prevalence, colour = 'Asian_Pacific_Islander'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Asian.or.Pacific.Islander.Prevalence, colour = 'Asian_Pacific_Islander'),
size=2,
shape=20,
alpha=0.5) +
# Add line for Hispanic.Prevalence
geom_line(aes(y = Hispanic.Prevalence, colour = 'Hispanic'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Hispanic.Prevalence, colour = 'Hispanic'),
size=2,
shape=20,
alpha=0.5) +
# Add line for Non.hispanic.black.Prevalence
geom_line(aes(y = Non.hispanic.black.Prevalence, colour = 'Non_Hispanic_Black'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Non.hispanic.black.Prevalence, colour = 'Non_Hispanic_Black'),
size=2,
shape=20,
alpha=0.5) +
# Add line for Non.hispanic.white.Prevalence
geom_line(aes(y = Non.hispanic.white.Prevalence, colour = 'Non_Hispanic_White'),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5) +
geom_point(aes(y = Non.hispanic.white.Prevalence, colour = 'Non_Hispanic_White'),
size=2,
shape=20,
alpha=0.5) +
scale_colour_manual(name="",
labels = c("ADDM Average",
"Asian/Pacific Islander",
"Hispanic",
"Non-Hispanic Black",
"Non-Hispanic White"),
values = c(ADDM_Average="darkslategrey",
Asian_Pacific_Islander ="darkred",
Hispanic ="darkorchid3",
Non_Hispanic_Black ="deepskyblue3",
Non_Hispanic_White ="chartreuse3"))
# Add title, axis label, and axis scale
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(5, 20, 5),
limits=c(5, 20)) +
scale_x_continuous(name = "Year",
breaks = seq(2000, 2016, 1),
limits = c(2000, 2016)) +
ggtitle("Prevalence Estimates by Race/Ethnicity [ Source: ADDM ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Show plot
p
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
# Apply theme
# p + theme_economist() + scale_colour_economist() # p + theme_wsj() + scale_colour_wsj("colors6")
# Dynamic chart:
p_dynamic <- p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Quiz:
</h3>
<p>
Change above zig-zag lines to spline/smooth lines.
</p>
<p>
Hints: Refer to <span style="color:blue">ADDM Network estimates for overall ASD prevalence in US over time</span>.
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<a href="">
</a>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">US. State Level Data Processing</span>
</h3>
# ----------------------------------
# Dataset: US. State Level Children ASD Prevalence
# ----------------------------------
ASD_State <- read.csv("../dataset/ADV_ASD_State.csv", stringsAsFactors = FALSE)
# Obtain number of rows and number of columns/features/variables
dim(ASD_State)
## [1] 1692 49
# Obtain overview (data structure/types)
str(ASD_State)
## 'data.frame': 1692 obs. of 49 variables:
## $ State : chr "AZ" "GA" "MD" "NJ" ...
## $ Denominator : int 45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
## $ Prevalence : num 6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
## $ Lower.CI : num 5.8 5.8 4.6 8.9 5.4 3.7 2.7 5.5 6.1 4.6 ...
## $ Upper.CI : num 7.3 7.3 6.6 11.1 7.4 5.5 3.9 7 7.8 7.5 ...
## $ Year : int 2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
## $ Source : chr "addm" "addm" "addm" "addm" ...
## $ Source_Full1 : chr "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
## $ State_Full1 : chr "Arizona" "Georgia" "Maryland" "New Jersey" ...
## $ State_Full2 : chr "AZ-Arizona" "GA-Georgia" "MD-Maryland" "NJ-New Jersey" ...
## $ Numerator_ASD : int 295 283 118 294 155 104 117 280 252 65 ...
## $ Numerator_NonASD : int 45027 43310 21414 29420 24380 22961 35355 44833 36220 10955 ...
## $ Proportion : num 0.00651 0.00649 0.00548 0.00989 0.00632 ...
## $ X95_Z_CI : num 0.00074 0.000754 0.000986 0.001125 0.000991 ...
## $ Z_Lower.CI : num 5.77 5.74 4.49 8.77 5.33 ...
## $ Z_Upper.CI : num 7.25 7.25 6.47 11.02 7.31 ...
## $ Z_Lower.CI_ABSerror : num 0.0314 0.062 0.1059 0.1311 0.0739 ...
## $ Z_Upper.CI_ABSerror : num 0.0507 0.0542 0.1337 0.0803 0.0911 ...
## $ Chi_Wilson_P : num 0.00655 0.00654 0.00557 0.00996 0.00639 ...
## $ X95_Chi_Wilson_CI : num 0.000741 0.000755 0.00099 0.001127 0.000994 ...
## $ Chi_Wilson_Lower.CI : num 5.81 5.78 4.58 8.83 5.4 ...
## $ Chi_Wilson_Upper.CI : num 7.29 7.29 6.56 11.08 7.39 ...
## $ Chi_Wilson_Lower.CI_ABSerror : num 0.009314 0.019761 0.021503 0.069416 0.000453 ...
## $ Chi_Wilson_Upper.CI_ABSerror : num 0.0077 0.00953 0.04165 0.01523 0.01087 ...
## $ Chi_Wilson_Corrected_w_minus.CI : num 0.0058 0.00577 0.00456 0.00881 0.00538 ...
## $ Chi_Wilson_Corrected_w_plus.CI : num 0.0073 0.0073 0.00658 0.0111 0.00741 ...
## $ Chi_Wilson_Corrected_Lower.CI : num 5.8 5.77 4.56 8.81 5.38 ...
## $ Chi_Wilson_Corrected_Upper.CI : num 7.3 7.3 6.58 11.1 7.41 ...
## $ Chi_Wilson_Corrected_Lower.CI_ABSerror: num 0.00109 0.03057 0.04265 0.08529 0.01834 ...
## $ Chi_Wilson_Corrected_Upper.CI_ABSerror: num 0.00395 0.0026 0.01636 0.00254 0.01108 ...
## $ Male.Prevalence : num 9.7 11 8.6 14.8 9.3 6.6 5 10.1 10.7 9.9 ...
## $ Male.Lower.CI : num 8.5 9.7 7.1 13 7.8 5.2 4.1 8.8 9.3 7.6 ...
## $ Male.Upper.CI : num 11.1 12.4 10.6 16.8 11.2 8.2 6.2 11.4 12.3 12.9 ...
## $ Female.Prevalence : num 3.2 2 2.2 4.3 3.3 2.4 1.4 2.2 2.9 1.7 ...
## $ Female.Lower.CI : num 2.5 1.5 1.5 3.3 2.4 1.6 0.9 1.7 2.2 0.9 ...
## $ Female.Upper.CI : num 4 2.7 2.7 5.5 4.5 3.5 2.1 2.9 3.8 3.2 ...
## $ Non.hispanic.white.Prevalence : num 8.6 7.9 4.9 11.3 6.5 4.5 3.3 7.7 7.4 6.4 ...
## $ Non.hispanic.white.Lower.CI : num 7.5 6.7 3.8 9.5 5.2 3.7 2.6 6.7 6.5 4.8 ...
## $ Non.hispanic.white.Upper.CI : num 9.8 9.3 6.4 13.3 8.2 5.5 4.1 8.9 8.6 8.5 ...
## $ Non.hispanic.black.Prevalence : chr "7.3" "5.3" "6.1" "10.6" ...
## $ Non.hispanic.black.Lower.CI : chr "4.4" "4.4" "4.7" "8.5" ...
## $ Non.hispanic.black.Upper.CI : chr "12.2" "6.4" "8" "13.1" ...
## $ Hispanic.Prevalence : chr "No data" "No data" "No data" "No data" ...
## $ Hispanic.Lower.CI : chr "No data" "No data" "No data" "No data" ...
## $ Hispanic.Upper.CI : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Prevalence : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Lower.CI : chr "No data" "No data" "No data" "No data" ...
## $ Asian.or.Pacific.Islander.Upper.CI : chr "No data" "No data" "No data" "No data" ...
## $ State_Region : chr "D8 Mountain" "D5 South Atlantic" "D5 South Atlantic" "D2 Middle Atlantic" ...
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">US. State Level Data</span> Pre-Process data
</h3>
Pre-Process data: Missing data
# Load required function from packages:
if(!require(naniar)){install.packages("naniar")}
## Loading required package: naniar
library(naniar)
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dplyr)
# Count missing values in dataframe:
sum(is.na(ASD_State)) # missing data recognised by R (NA)
## [1] 14454
# Define several offending strings
na_strings <- c("", "No data", "NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")
# Replace these defined missing values to R's internal NA
ASD_State = replace_with_na_all(ASD_State, condition = ~.x %in% na_strings)
# Count missing values in dataframe:
sum(is.na(ASD_State))
## [1] 28992
Remove invalid unicode char/string: 92
# Remove invalid unicode char/string: \x92
ASD_State$Source_Full1[ASD_State$Source_Full1 == "National Survey of Children\x92s Health"] <- "National Survey of Children's Health"
Delete/Drop variable by index: column from 14 to 26, 29, and 30
cbind(names(ASD_State), c(1:length(names(ASD_State))))
## [,1] [,2]
## [1,] "State" "1"
## [2,] "Denominator" "2"
## [3,] "Prevalence" "3"
## [4,] "Lower.CI" "4"
## [5,] "Upper.CI" "5"
## [6,] "Year" "6"
## [7,] "Source" "7"
## [8,] "Source_Full1" "8"
## [9,] "State_Full1" "9"
## [10,] "State_Full2" "10"
## [11,] "Numerator_ASD" "11"
## [12,] "Numerator_NonASD" "12"
## [13,] "Proportion" "13"
## [14,] "X95_Z_CI" "14"
## [15,] "Z_Lower.CI" "15"
## [16,] "Z_Upper.CI" "16"
## [17,] "Z_Lower.CI_ABSerror" "17"
## [18,] "Z_Upper.CI_ABSerror" "18"
## [19,] "Chi_Wilson_P" "19"
## [20,] "X95_Chi_Wilson_CI" "20"
## [21,] "Chi_Wilson_Lower.CI" "21"
## [22,] "Chi_Wilson_Upper.CI" "22"
## [23,] "Chi_Wilson_Lower.CI_ABSerror" "23"
## [24,] "Chi_Wilson_Upper.CI_ABSerror" "24"
## [25,] "Chi_Wilson_Corrected_w_minus.CI" "25"
## [26,] "Chi_Wilson_Corrected_w_plus.CI" "26"
## [27,] "Chi_Wilson_Corrected_Lower.CI" "27"
## [28,] "Chi_Wilson_Corrected_Upper.CI" "28"
## [29,] "Chi_Wilson_Corrected_Lower.CI_ABSerror" "29"
## [30,] "Chi_Wilson_Corrected_Upper.CI_ABSerror" "30"
## [31,] "Male.Prevalence" "31"
## [32,] "Male.Lower.CI" "32"
## [33,] "Male.Upper.CI" "33"
## [34,] "Female.Prevalence" "34"
## [35,] "Female.Lower.CI" "35"
## [36,] "Female.Upper.CI" "36"
## [37,] "Non.hispanic.white.Prevalence" "37"
## [38,] "Non.hispanic.white.Lower.CI" "38"
## [39,] "Non.hispanic.white.Upper.CI" "39"
## [40,] "Non.hispanic.black.Prevalence" "40"
## [41,] "Non.hispanic.black.Lower.CI" "41"
## [42,] "Non.hispanic.black.Upper.CI" "42"
## [43,] "Hispanic.Prevalence" "43"
## [44,] "Hispanic.Lower.CI" "44"
## [45,] "Hispanic.Upper.CI" "45"
## [46,] "Asian.or.Pacific.Islander.Prevalence" "46"
## [47,] "Asian.or.Pacific.Islander.Lower.CI" "47"
## [48,] "Asian.or.Pacific.Islander.Upper.CI" "48"
## [49,] "State_Region" "49"
# Delete/Drop variable by index: column from 14 to 26, 29, and 30
# names(ASD_State)
ASD_State <- ASD_State[ -c(14:26, 29, 30) ]
Create new variables
# Create one new variable: Source_UC as uppercase of Source
ASD_State$Source_UC <- paste(toupper(ASD_State$Source))
# Create one new variable: Source_Full3 by combining Source_UC and Source_Full1
ASD_State$Source_Full3 <- paste(ASD_State$Source_UC, ASD_State$Source_Full1)
Create one new ordinal categorical variable: Prevalence_Rank2 (“Low”, “High”) by binning Prevalence
# Recode Risk into category from Prevalence
# Low [0, 5)
# High [5, +oo)
ASD_State$Prevalence_Risk2[ASD_State$Prevalence < 5] = "Low"
## Warning: Unknown or uninitialised column: 'Prevalence_Risk2'.
ASD_State$Prevalence_Risk2[ASD_State$Prevalence >= 5 ] = "High"
#
# head(ASD_State)
Create one new ordinal categorical variable: Prevalence_Rank4 (“Low”, “Medium”, “High”, “Very High”) by binning Prevalence
# Recode Risk into category from Prevalence
# Low [0, 5)
# Medium [5, 10)
# High [10, 20)
# Very High [20, +oo)
ASD_State$Prevalence_Risk4 = "Very High"
ASD_State$Prevalence_Risk4[ASD_State$Prevalence < 20 ] = "High"
ASD_State$Prevalence_Risk4[ASD_State$Prevalence < 10 ] = "Medium"
ASD_State$Prevalence_Risk4[ASD_State$Prevalence < 5] = "Low"
#
# head(ASD_State)
Convert to correct data types
str(ASD_State)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1692 obs. of 38 variables:
## $ State : chr "AZ" "GA" "MD" "NJ" ...
## $ Denominator : int 45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
## $ Prevalence : num 6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
## $ Lower.CI : num 5.8 5.8 4.6 8.9 5.4 3.7 2.7 5.5 6.1 4.6 ...
## $ Upper.CI : num 7.3 7.3 6.6 11.1 7.4 5.5 3.9 7 7.8 7.5 ...
## $ Year : int 2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
## $ Source : chr "addm" "addm" "addm" "addm" ...
## $ Source_Full1 : chr "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
## $ State_Full1 : chr "Arizona" "Georgia" "Maryland" "New Jersey" ...
## $ State_Full2 : chr "AZ-Arizona" "GA-Georgia" "MD-Maryland" "NJ-New Jersey" ...
## $ Numerator_ASD : int 295 283 118 294 155 104 117 280 252 65 ...
## $ Numerator_NonASD : int 45027 43310 21414 29420 24380 22961 35355 44833 36220 10955 ...
## $ Proportion : num 0.00651 0.00649 0.00548 0.00989 0.00632 ...
## $ Chi_Wilson_Corrected_Lower.CI : num 5.8 5.77 4.56 8.81 5.38 ...
## $ Chi_Wilson_Corrected_Upper.CI : num 7.3 7.3 6.58 11.1 7.41 ...
## $ Male.Prevalence : num 9.7 11 8.6 14.8 9.3 6.6 5 10.1 10.7 9.9 ...
## $ Male.Lower.CI : num 8.5 9.7 7.1 13 7.8 5.2 4.1 8.8 9.3 7.6 ...
## $ Male.Upper.CI : num 11.1 12.4 10.6 16.8 11.2 8.2 6.2 11.4 12.3 12.9 ...
## $ Female.Prevalence : num 3.2 2 2.2 4.3 3.3 2.4 1.4 2.2 2.9 1.7 ...
## $ Female.Lower.CI : num 2.5 1.5 1.5 3.3 2.4 1.6 0.9 1.7 2.2 0.9 ...
## $ Female.Upper.CI : num 4 2.7 2.7 5.5 4.5 3.5 2.1 2.9 3.8 3.2 ...
## $ Non.hispanic.white.Prevalence : num 8.6 7.9 4.9 11.3 6.5 4.5 3.3 7.7 7.4 6.4 ...
## $ Non.hispanic.white.Lower.CI : num 7.5 6.7 3.8 9.5 5.2 3.7 2.6 6.7 6.5 4.8 ...
## $ Non.hispanic.white.Upper.CI : num 9.8 9.3 6.4 13.3 8.2 5.5 4.1 8.9 8.6 8.5 ...
## $ Non.hispanic.black.Prevalence : chr "7.3" "5.3" "6.1" "10.6" ...
## $ Non.hispanic.black.Lower.CI : chr "4.4" "4.4" "4.7" "8.5" ...
## $ Non.hispanic.black.Upper.CI : chr "12.2" "6.4" "8" "13.1" ...
## $ Hispanic.Prevalence : chr NA NA NA NA ...
## $ Hispanic.Lower.CI : chr NA NA NA NA ...
## $ Hispanic.Upper.CI : chr NA NA NA NA ...
## $ Asian.or.Pacific.Islander.Prevalence: chr NA NA NA NA ...
## $ Asian.or.Pacific.Islander.Lower.CI : chr NA NA NA NA ...
## $ Asian.or.Pacific.Islander.Upper.CI : chr NA NA NA NA ...
## $ State_Region : chr "D8 Mountain" "D5 South Atlantic" "D5 South Atlantic" "D2 Middle Atlantic" ...
## $ Source_UC : chr "ADDM" "ADDM" "ADDM" "ADDM" ...
## $ Source_Full3 : chr "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" ...
## $ Prevalence_Risk2 : chr "High" "High" "High" "High" ...
## $ Prevalence_Risk4 : chr "Medium" "Medium" "Medium" "Medium" ...
# cbind(names(ASD_State), c(1:length(names(ASD_State))))
Convert variables to numeric
# Convert Prevalence and CIs from categorical/chr to numeric
ix <- 13:33 # define an index
ASD_State[ix] <- lapply(ASD_State[ix], as.numeric)
Convert variables to categorical/factor
# Convert Source from categorical/chr to categorical/factor
ix <- c(1, 7, 8, 9, 10, 34, 35, 36) # define an index
ASD_State[ix] <- lapply(ASD_State[ix], as.factor)
# Create new ordered factor Year_Factor from Year
ASD_State$Year_Factor <- factor(ASD_State$Year, ordered = TRUE)
Convert Prevalence_Rank2 & Prevalence_Rank4 to ordered factor
# Convert to factor
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE,
levels=c("Low", "High"))
# Convert to factor
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE,
levels=c("Low", "Medium", "High", "Very High"))
# Display unique values (levels) of a factor categrotical
lapply(select_if(ASD_State, is.factor), levels)
## $State
## [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VA" "VT" "WA" "WI" "WV" "WY"
##
## $Source
## [1] "addm" "medi" "nsch" "sped"
##
## $Source_Full1
## [1] "Autism & Developmental Disabilities Monitoring Network"
## [2] "Medicaid"
## [3] "National Survey of Children's Health"
## [4] "Special Education Child Count"
##
## $State_Full1
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
##
## $State_Full2
## [1] "AK-Alaska" "AL-Alabama"
## [3] "AR-Arkansas" "AZ-Arizona"
## [5] "CA-California" "CO-Colorado"
## [7] "CT-Connecticut" "DC-District of Columbia"
## [9] "DE-Delaware" "FL-Florida"
## [11] "GA-Georgia" "HI-Hawaii"
## [13] "IA-Iowa" "ID-Idaho"
## [15] "IL-Illinois" "IN-Indiana"
## [17] "KS-Kansas" "KY-Kentucky"
## [19] "LA-Louisiana" "MA-Massachusetts"
## [21] "MD-Maryland" "ME-Maine"
## [23] "MI-Michigan" "MN-Minnesota"
## [25] "MO-Missouri" "MS-Mississippi"
## [27] "MT-Montana" "NC-North Carolina"
## [29] "ND-North Dakota" "NE-Nebraska"
## [31] "NH-New Hampshire" "NJ-New Jersey"
## [33] "NM-New Mexico" "NV-Nevada"
## [35] "NY-New York" "OH-Ohio"
## [37] "OK-Oklahoma" "OR-Oregon"
## [39] "PA-Pennsylvania" "RI-Rhode Island"
## [41] "SC-South Carolina" "SD-South Dakota"
## [43] "TN-Tennessee" "TX-Texas"
## [45] "UT-Utah" "VA-Virginia"
## [47] "VT-Vermont" "WA-Washington"
## [49] "WI-Wisconsin" "WV-West Virginia"
## [51] "WY-Wyoming"
##
## $State_Region
## [1] "D1 New England" "D2 Middle Atlantic" "D3 East North Central"
## [4] "D4 West North Central" "D5 South Atlantic" "D6 East South Central"
## [7] "D7 West South Central" "D8 Mountain" "D9 Pacific"
##
## $Source_UC
## [1] "ADDM" "MEDI" "NSCH" "SPED"
##
## $Source_Full3
## [1] "ADDM Autism & Developmental Disabilities Monitoring Network"
## [2] "MEDI Medicaid"
## [3] "NSCH National Survey of Children's Health"
## [4] "SPED Special Education Child Count"
##
## $Prevalence_Risk2
## [1] "Low" "High"
##
## $Prevalence_Risk4
## [1] "Low" "Medium" "High" "Very High"
##
## $Year_Factor
## [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016"
Optionally, export the processed dataframe data to CSV file.
write.csv(ASD_State, file = "../dataset/ADV_ASD_State_R.csv", row.names = FALSE)
# Read back in above saved file:
# ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
# ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">US. State Level Data Visualisation</span>
</h3>
<h3>
<span style="color:blue">Above chat shows at data source level, we'd also like to know State level data availbility. How?</span>
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Explore the Data</span> [ Years Data Available by State ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=12)
# ----------------------------------
# [State] < Years Data Available by State >
# ----------------------------------
p <- ggplot(ASD_State, aes(x = Source, fill = Source)) +
geom_bar() + theme(axis.text.x=element_blank(), # Hide axis
axis.ticks.x=element_blank(), # Hide axis
axis.text.y=element_blank(), # Hide axis
axis.ticks.y=element_blank(), # Hide axis
panel.background = element_blank(), # Remove panel background
legend.position="top",
strip.text.y = element_text(angle=0) # Rotate text to horizontal
) +
scale_fill_manual("Data Source:", values = c("addm" = "darkblue",
"medi" = "orange",
"nsch" = "darkred",
"sped" = "skyblue")) +
facet_grid(facets = State_Full2 ~ Year) +
labs(x="", y="", title="Years Data Available by State") # layers of graphics
# Below plot may run for a while
# Show plot
p
Filter and create dataframe of different data sources, for easy data access
# Filter and create dataframe of different data sources, for easy data access
ASD_State_ADDM <- subset(ASD_State, Source == 'addm')
ASD_State_MEDI <- subset(ASD_State, Source == 'medi')
ASD_State_NSCH <- subset(ASD_State, Source == 'nsch')
ASD_State_SPED <- subset(ASD_State, Source == 'sped')
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Explore the Data</span> Years Data Available by State [ Source: ADDM ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
Years Data Available by State [ Source: ADDM ]
# Years Data Available by State [ Source: ADDM ]
p <- ggplot(ASD_State_ADDM, aes(x = 1, fill = State_Full2)) +
geom_bar() + theme(axis.text.x=element_blank(), # Hide axis
axis.ticks.x=element_blank(), # Hide axis
axis.text.y=element_blank(), # Hide axis
axis.ticks.y=element_blank(), # Hide axis
panel.background = element_blank(), # Remove panel background
legend.position="none",
strip.text.y = element_text(angle=0) # Rotate text to horizontal
) +
facet_grid(facets = State_Full2 ~ Year_Factor) +
labs(x="", y="", title="Years Data Available by State [ Source: ADDM ]") # layers of graphics
# Show plot
p
<h3>
Quiz:
</h3>
<p>
Create <span style="color:blue">Years Data Available by State [ Source: XXXX ]</span> for other three data sources:
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION (States)</span> Prevalence Estimates by State [ Source: ADDM ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
Visualise: Prevalence Estimates by State [ Source: ADDM ]
# Prevalence Estimates by State [ Source: ADDM ] , aggregated for different years
p <- ggplot(ASD_State_ADDM, aes(x = reorder(State_Full2, Prevalence, FUN = median), # Order States by median of Prevalence
y = Prevalence)) +
geom_boxplot(aes(fill = reorder(State_Full2, Prevalence, FUN = median))) + # fill color by State
scale_fill_discrete(guide = guide_legend(title = "US. States")) + # Legend Name
# geom_boxplot(fill = 'darkslategrey', alpha = 0.2) +
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_discrete(name = "") +
ggtitle("Prevalence Estimates by State [ Source: ADDM ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
legend.position = 'none') +
coord_flip() + # Rotate chart
geom_jitter(alpha = 0.5, position = position_jitter(width = 0.1)) # Add actual data points
# Show plot
p
# Theme of the economist magazine:
# p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Quiz:
</h3>
<p>
Create <span style="color:blue">Prevalence Estimates by State [ Source: XXXX ]</span> for other three data sources:
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> No. Children Surveyed by State [ Source: ADDM ] [Year 2014]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
Visualise: No. Children Surveyed by State [ Source: ADDM ] [Year 2014]
# All State Prevalence data with: Source == 'addm' & Year == 2014
# filter using dataframe: ASD_State_ADDM
ASD_State_Subset <- subset(ASD_State_ADDM, Year == 2014)
# or filer using dataframe: ASD_State
ASD_State_Subset <- subset(ASD_State, Source == 'addm' & Year == 2014)
# Bar plot/chart for < No. Children surveyed by State [ADDM] [Year 2014] >
p <- ggplot(ASD_State_Subset, aes(x = reorder(State_Full1, Denominator, FUN = median), # Order States by median of Denominator
y = Denominator)) +
geom_bar(stat="identity", aes(fill = reorder(State_Full1, Denominator, FUN = median))) + # fill color by State
scale_fill_discrete(guide = guide_legend(title = "US. States")) + # Legend Name
scale_x_discrete(name = "US. States") +
scale_y_continuous(name = "No. Children (Denominator)") +
ggtitle("No. Children Surveyed by State [ Source: ADDM ] [Year 2014]") +
# geom_text(aes(label=Denominator), vjust=1.6, color="darkslategrey", size=3.5) + # Show data label inside bars
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
legend.position="none")
# Show plot
p
# Theme of the economist magazine:
# p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Quiz:
</h3>
<p>
Create <span style="color:blue">No. Children Surveyed by State [ Source: XXXX ] [Year CCYY]</span> for other data sources & years:
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Quiz:
</h3>
<p>
Create <span style="color:blue">No. ASD Children by State [ Source: XXXX ] [Year CCYY]</span> for other data sources & years:
</p>
<p>
Hint: Use variable: ASD_State_ADDM$Numerator_ASD
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> Prevalence Estimates with 95% CI by State [ Source: ADDM ] [ Year 2014 ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
Visualise: Prevalence Estimates with 95% CI by State [ Source: ADDM ] [ Year 2014 ]
# ASD_State_Subset <- subset(ASD_State_ADDM, Year == 2014)
# or
# ASD_State_Subset <- subset(ASD_State, Source == 'addm' & Year == 2014)
# Point plot/chart
p = ggplot(ASD_State_Subset, aes(x = reorder(State_Full1, Prevalence, median), # Order States by median of Prevalence
y = Prevalence)) +
geom_point(stat="identity", aes(colour = reorder(State_Full1, Prevalence, median)), size = 10, alpha = 0.1, pch = 15) + # fill color by State
scale_colour_discrete(guide = guide_legend(title = "US. States")) + # Legend Name
scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(10, 35, 5),
limits=c(10, 35)) +
scale_x_discrete(name = "US. States") +
ggtitle("Prevalence Estimates with 95% CI by State [ Source: ADDM ] [ Year 2014 ]") +
theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
legend.position = 'none') +
geom_text(aes(label=Prevalence), hjust=0.5, color="black", size=3.5) # Show data label inside bars
# Show plot
p
# Add Lower.CI
p = p + geom_point(data = ASD_State_Subset, aes(x = reorder(State_Full1, Prevalence, median), y = Lower.CI,
shape=Source # point shape
),
size = 2 # point size
) +
# geom_text(aes(label=Lower.CI), hjust=-0.1, vjust=3, color="darkslategrey", size=2.5) + # Show data label inside bars
scale_shape_manual(values=3) # manual define point shape
# Show plot
p
# Add Upper.CI
p = p + geom_point(data = ASD_State_Subset, aes(x = reorder(State_Full1, Prevalence, median), y = Upper.CI,
shape=Source # point shape
),
size = 2 # point size
)
# geom_text(aes(label=Upper.CI), hjust=-0.1, vjust=-3, color="darkslategrey", size=2.5) # Show data label inside bars
# Show plot
p
# theme of the economist magazine:
# p + theme_economist() + scale_colour_economist() + scale_colour_discrete(guide = guide_legend(title = "US. States")) + theme(legend.position = 'none')
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + scale_colour_discrete(guide = guide_legend(title = "US. States")) + theme(legend.position = 'none')
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Quiz:
</h3>
<p>
Create <span style="color:blue">Prevalence Estimates with 95% CI by State [ Source: ADDM ] [Year CCYY]</span> for other data sources & years:
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> Prevalence Estimates over Year [ Source: ADDM ] [ State: AZ-Arizona ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
Visualise: Prevalence Estimates over Year [ Source: ADDM ] [ State: AZ-Arizona ]
# All year/time Prevalence data with: Source_UC == 'ADDM' & State_Full2 == 'AZ-Arizona'
ASD_State_Subset <- subset(ASD_State, Source_UC == 'ADDM' & State_Full2 == 'AZ-Arizona')
# Line plot/chart for < State ASD Prevalence [ADDM] [AZ-Arizona] >
p <- ggplot(ASD_State_Subset, aes(x = Year, y = Prevalence))
# Select (add) line chart type:
p <- p + geom_line(aes(color = State_Full2),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5)
# Select (add) points to chart:
p <- p + geom_point(aes(color = State_Full2),
size=3,
shape=20,
alpha=0.5)
# Customize legend name:
p <- p + labs(color = "US. State")
# Adjust x and y axis, scale, limit and labels:
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_continuous(name = "Year",
breaks = seq(2000, 2016, 1),
limits = c(2000, 2016))
# Customize chart title:
p <- p + ggtitle("Prevalence Estimates over Year [ Source: ADDM ] [ State: AZ-Arizona ]")
# Customize chart title and axis labels:
p <- p + theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Show plot
p
# Theme of the economist magazine:
p + theme_economist() + scale_colour_economist()
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> Prevalence Estimates over Year [ Source: ADDM ] [ State: ALL ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
Visualise: Prevalence Estimates over Year [ Source: ADDM ] [ State: ALL ]
p <- ggplot(ASD_State_ADDM, aes(x = Year, y = Prevalence))
# Select (add) line chart type:
p <- p + geom_line(aes(color = State_Full2),
linetype = "solid", # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
size=1,
alpha=0.5)
# Select (add) points to chart:
p <- p + geom_point(aes(color = State_Full2),
size=3,
shape=20,
alpha=0.5)
# Show plot
# p
# Customize line color and legend name:
p <- p + labs(color = "US. State")
# Adjust x and y axis, scale, limit and labels:
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
breaks = seq(0, 30, 5),
limits=c(0, 30)) +
scale_x_continuous(name = "Year (2000 - 2016)",
breaks = seq(2000, 2016, 1),
limits = c(2000, 2016))
# Customize chart title:
p <- p + ggtitle("Prevalence Estimates over Year [ Source: ADDM ] [ State: ALL ]")
# Customize chart title and axis labels:
p <- p + theme(title = element_text(face = 'bold.italic', color = "darkslategrey"),
axis.title = element_text(face = 'plain', color = "darkslategrey"),
legend.position="right")
# Show plot
p
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + scale_colour_discrete(guide = guide_legend(title = "US. States"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
Split chart by state
# Show plot in facet_grid
p + facet_grid(facets = . ~ State) +
theme(legend.position = "none", # Hide legend
axis.text.x=element_blank(), # Hide axis
axis.ticks.x=element_blank(), # Hide axis
panel.background = element_blank(), # Remove panel background
panel.grid.major = element_line(size = 0.1, linetype = 1, colour = "lightgrey")
)
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
<h3>
Data Visualisation (Enhanced) - Plotting on Map
</h3>
# ----------------------------------
# EDA - Visualisation on map
# ----------------------------------
if(!require(usmap)){install.packages("usmap")}
## Loading required package: usmap
library(usmap) # usmap: Mapping the US
<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ CDC ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span>
</h3>
<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span>
</h3>
Let’s review data availability by data Sources & Years:
ASD_State_ADDM in Years: 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014
ASD_State_MEDI in Years: 2000 ~ 2012
ASD_State_NSCH in Years: 2004, 2008, 2012, 2016
ASD_State_SPED in Years: 2000 ~ 2016
<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span> [ Source: ADDM ] [ Year: 2014 ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
Prepare US State level data: [ Source: ADDM ] [ Year: 2014 ]
# Prepare data - addm 2014
Map_Data_Source = 'addm' # Available values lowercase: 'addm', 'medi', 'nsch', 'sped'.
Map_Data_Value = 'Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
# Uncomment below to use Prevalence of different groups:
# Map_Data_Value = 'Male.Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
# Map_Data_Value = 'Female.Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
# Map_Data_Value = 'Asian.or.Pacific.Islander.Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
Map_Data_Year = 2014 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
The usmap package/function requires input data to have a column of state, or fips. (case sensitive)
state: Name of US state
fips: FIPS code for either a US state
https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
https://cran.r-project.org/web/packages/usmap/usmap.pdf
# The usmap package/function requires input data to have a column of 'state', or 'fips'. (case sensitive)
ASD_State_Subset$state = ASD_State_Subset$State
# Glance
head(ASD_State_Subset)
## # A tibble: 6 x 40
## State Denominator Prevalence Lower.CI Upper.CI Year Source Source_Full1
## <fct> <int> <dbl> <dbl> <dbl> <int> <fct> <fct>
## 1 AZ 24952 14 12.6 15.5 2014 addm Autism & De…
## 2 AR 39992 13.1 12 14.2 2014 addm Autism & De…
## 3 CO 41128 13.9 12.8 15.1 2014 addm Autism & De…
## 4 GA 51161 17 15.9 18.1 2014 addm Autism & De…
## 5 MD 9955 20 17.4 22.9 2014 addm Autism & De…
## 6 MN 9767 24 21.1 27.2 2014 addm Autism & De…
## # … with 32 more variables: State_Full1 <fct>, State_Full2 <fct>,
## # Numerator_ASD <int>, Numerator_NonASD <int>, Proportion <dbl>,
## # Chi_Wilson_Corrected_Lower.CI <dbl>, Chi_Wilson_Corrected_Upper.CI <dbl>,
## # Male.Prevalence <dbl>, Male.Lower.CI <dbl>, Male.Upper.CI <dbl>,
## # Female.Prevalence <dbl>, Female.Lower.CI <dbl>, Female.Upper.CI <dbl>,
## # Non.hispanic.white.Prevalence <dbl>, Non.hispanic.white.Lower.CI <dbl>,
## # Non.hispanic.white.Upper.CI <dbl>, Non.hispanic.black.Prevalence <dbl>,
## # Non.hispanic.black.Lower.CI <dbl>, Non.hispanic.black.Upper.CI <dbl>,
## # Hispanic.Prevalence <dbl>, Hispanic.Lower.CI <dbl>,
## # Hispanic.Upper.CI <dbl>, Asian.or.Pacific.Islander.Prevalence <dbl>,
## # Asian.or.Pacific.Islander.Lower.CI <dbl>,
## # Asian.or.Pacific.Islander.Upper.CI <dbl>, State_Region <fct>,
## # Source_UC <fct>, Source_Full3 <fct>, Prevalence_Risk2 <ord>,
## # Prevalence_Risk4 <ord>, Year_Factor <ord>, state <fct>
Visualise: Prevalence Estimates by Geographic Area [ Source: ADDM ] [ Year: 2014 ]
# Show data on map
p_map_addm_2014 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value,
color = "white", # map line colour
labels = TRUE, # State name shown
label_color = 'white' # State name colour
) +
scale_fill_continuous(
na.value = "lightgrey", # Set colour with no State data
low="lightblue1", high = "darkblue",
name = "Prevalence\nper 1,000\nChildren",
limits=c(0, 40) #same colour levels/limits for plots
) +
labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"),
subtitle = 'https://www.cdc.gov/ncbddd/autism'
) +
theme(panel.background = element_rect(color = "white", fill = "white"),
legend.position = "right")
# Show map
p_map_addm_2014
# Dynamic map
p_dynamic <- p_map_addm_2014
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span> [ Source: NSCH] [ Year: 2004, 2008, 2012, 2016 ]
</h3>
Prepare US State level data: [ Source: NSCH ] [ Year: ALL ]
Map_Data_Source = 'nsch' # Available values lowercase: 'addm', 'medi', 'nsch', 'sped'.
Map_Data_Value = 'Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2004 ]
# Prepare data - nsch 2004
Map_Data_Year = 2004 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
# Plot on map
p_map_nsch_2004 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2004
Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2008 ]
# Prepare data - nsch 2008
Map_Data_Year = 2008 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
p_map_nsch_2008 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2008
Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2012 ]
# Prepare data - nsch 2012
Map_Data_Year = 2012 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
p_map_nsch_2012 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2012
Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2016 ]
# Prepare data - nsch 2016
Map_Data_Year = 2016 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
p_map_nsch_2016 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2016
# Dynamic map
p_dynamic <- p_map_nsch_2016 # [ Source: NSCH ] [ Year: 2016 ]
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
Combine multiple plots to show in one page/screen:
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
# ----------------------------------
# Combine multiple plots
# ----------------------------------
if(!require(cowplot)){install.packages("cowplot")}
## Loading required package: cowplot
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
##
## theme_map
library('cowplot')
cowplot::plot_grid(
p_map_nsch_2004,
p_map_nsch_2008,
p_map_nsch_2012,
p_map_nsch_2016,
nrow = 2)
Export current plot as image file:
# ----------------------------------
# Export current plot as image file
# ----------------------------------
ggsave("plot Map Prevalence Estimates by Geographic Area [NSCH] [2004-2016].png",
width = 60, height = 30, units = 'cm')
<a href="">
</a>
<h3>
What to submit?
</h3>
<p>
Choose one of below visualisations/charts, use R to construct the chart nicely.
</p>
<p>
Optionally, enhance it with additional data dimensions to be better than original chart.
</p>
https://www.cdc.gov/ncbddd/autism/data/index.html
# Write your code below and press Shift+Enter to execute
Connect with the author:
This notebook was written by GU Zhan (Sam).
Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.
Copyright © 2020 GU Zhan
This notebook and its source code are released under the terms of the MIT License.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
<a href="">
</a>
<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>
https://github.com/telescopeuser/S-SB-Workshop
<h3>
Correlation of Numeric Variables
</h3>
# ----------------------------------
# Correlation of Numeric Variables
# ----------------------------------
cor_df = select_if(ASD_State, is.numeric) # Select only numeric variables
cor_df = cor_df[, colSums(is.na(cor_df)) == 0] # Select vaariables without NA
# Compute correlation matrix for No-NA numeric variables:
cor_table = cor(cor_df)
cor_table
## Denominator Prevalence Lower.CI Upper.CI
## Denominator 1.00000000 -0.1374662 -0.07863304 -0.17389486
## Prevalence -0.13746621 1.0000000 0.95813468 0.96568034
## Lower.CI -0.07863304 0.9581347 1.00000000 0.85132455
## Upper.CI -0.17389486 0.9656803 0.85132455 1.00000000
## Year 0.02851671 0.6400295 0.67690938 0.56480277
## Numerator_ASD 0.82429404 0.1121787 0.21429644 0.02005452
## Numerator_NonASD 0.99999025 -0.1392238 -0.08080949 -0.17516773
## Proportion -0.13735462 0.9999677 0.95851437 0.96524017
## Chi_Wilson_Corrected_Lower.CI -0.08734046 0.9761979 0.99597141 0.88837741
## Chi_Wilson_Corrected_Upper.CI -0.17380524 0.9798117 0.88384420 0.99561482
## Year Numerator_ASD Numerator_NonASD
## Denominator 0.02851671 0.82429404 0.99999025
## Prevalence 0.64002950 0.11217865 -0.13922381
## Lower.CI 0.67690938 0.21429644 -0.08080949
## Upper.CI 0.56480277 0.02005452 -0.17516773
## Year 1.00000000 0.29628163 0.02638864
## Numerator_ASD 0.29628163 1.00000000 0.82178563
## Numerator_NonASD 0.02638864 0.82178563 1.00000000
## Proportion 0.64020778 0.11251687 -0.13911415
## Chi_Wilson_Corrected_Lower.CI 0.67167964 0.19523745 -0.08942415
## Chi_Wilson_Corrected_Upper.CI 0.58775086 0.03675270 -0.17520779
## Proportion Chi_Wilson_Corrected_Lower.CI
## Denominator -0.1373546 -0.08734046
## Prevalence 0.9999677 0.97619788
## Lower.CI 0.9585144 0.99597141
## Upper.CI 0.9652402 0.88837741
## Year 0.6402078 0.67167964
## Numerator_ASD 0.1125169 0.19523745
## Numerator_NonASD -0.1391141 -0.08942415
## Proportion 1.0000000 0.97646889
## Chi_Wilson_Corrected_Lower.CI 0.9764689 1.00000000
## Chi_Wilson_Corrected_Upper.CI 0.9796180 0.91344122
## Chi_Wilson_Corrected_Upper.CI
## Denominator -0.1738052
## Prevalence 0.9798117
## Lower.CI 0.8838442
## Upper.CI 0.9956148
## Year 0.5877509
## Numerator_ASD 0.0367527
## Numerator_NonASD -0.1752078
## Proportion 0.9796180
## Chi_Wilson_Corrected_Lower.CI 0.9134412
## Chi_Wilson_Corrected_Upper.CI 1.0000000
# ----------------------------------
# Visualise Correlation Matrix
# ----------------------------------
if(!require(corrplot)){install.packages("corrplot")}
## Loading required package: corrplot
## corrplot 0.84 loaded
library('corrplot')
# Sort on decreasing correlations with Prevalence
cor_table_sorted <- as.matrix(sort(cor_table[,'Prevalence'], decreasing = TRUE))
#
cor_table_sorted
## [,1]
## Prevalence 1.0000000
## Proportion 0.9999677
## Chi_Wilson_Corrected_Upper.CI 0.9798117
## Chi_Wilson_Corrected_Lower.CI 0.9761979
## Upper.CI 0.9656803
## Lower.CI 0.9581347
## Year 0.6400295
## Numerator_ASD 0.1121787
## Denominator -0.1374662
## Numerator_NonASD -0.1392238
# Select corelations variables based on threshold:
#cor_var_high <- names(which(apply(cor_table_sorted, 1, function(x) abs(x)>0.25)))
cor_var_high <- names(which(apply(cor_table_sorted, 1, function(x) abs(x)>0.05)))
#
cor_var_high
## [1] "Prevalence" "Proportion"
## [3] "Chi_Wilson_Corrected_Upper.CI" "Chi_Wilson_Corrected_Lower.CI"
## [5] "Upper.CI" "Lower.CI"
## [7] "Year" "Numerator_ASD"
## [9] "Denominator" "Numerator_NonASD"
# Visualise:
cor_table_plot <- cor_table[cor_var_high, cor_var_high]
# cor_table_plot
#
corrplot(cor_table_plot, tl.col="black", tl.pos = "lt")
<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>